Statistics 2: Probabilities, Distributions, and Tests

Notebook Summary

This notebook is a series of exercises to practice utilizing probabiliteis, distributions and tests by answering questions in relation to titianic passenger data. The following will be presented with headers that incorperate the questions followed by the calculations and a written summary of the result.

Importing data file



In [1]:

    
import numpy as np
import pandas as pd

titanic_data = pd.read_csv('train.csv')
titanic_data.head(5)









    Out[1]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

Cleaning and filling data

Checking to see what columns need to be filled using the .info() method.



In [2]:

    
titanic_data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Filling all NaN ages with the mean of all the ages and confirming with .info() method. We later compensate for this with a functions that remove the mean. This will be pointed out with a 'COMPENSATION:' and define the action as it arises.



In [3]:

    
titanic_data.Age = titanic_data.Age.fillna(np.mean(titanic_data.Age))



In [4]:

    
titanic_data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Question 1

Calculate the probability of survival during Titanic crash.



In [5]:

    
survivors = titanic_data[titanic_data.Survived == 1]
survivor_prob = (len(survivors) / len(titanic_data))
print("There is a " + str(survivor_prob) + " percent chance of survival.")









    



There is a 0.3838383838383838 percent chance of survival.

Question 2

CHOOSE TWO OF...

A passenger was male
A passenger was female and had atleast 1 SibSp on board
A survivor was from Cherbourg

Our first choice of question is to see the probability that a passenger was male.



In [6]:

    
male_passenger = titanic_data[titanic_data.Sex == 'male']
prob_male = (len(male_passenger) / len(titanic_data))
print("There is a " + str(prob_male) + " percent probability that a passenger was male.")









    



There is a 0.6475869809203143 percent probability that a passenger was male.

Our second choice of question is to find the probablity that a survivor was from Cherbourg.



In [7]:

    
c_port = survivors[survivors.Embarked == 'C']
prob_c = (len(c_port) / len(survivors))
print("There is a " + str(prob_c) + " percent probability that a survivor was from Cherbourg.")









    



There is a 0.2719298245614035 percent probability that a survivor was from Cherbourg.

Question 3

Plot the distribution of passenger ages. (Bins = 25)

COMPENSATION: Earlier all the ages that were replaced with the mean. Because that would centrally skew the distribution histogram we round the passenger age decimial to the third and match it against the mean roudned to the third decimal and remove that value from the data set. By removing those fills the data will be represented correctly. (This is waht the for loop does below.)



In [8]:

    
import matplotlib.pyplot as plt
%matplotlib inline
all_ages = []
age_mean = np.mean(titanic_data.Age)

for i, k in enumerate(titanic_data.Age): 
    if round(k, 3) != round(age_mean, 3):
        all_ages.append(k)

H, edges = np.histogram(all_ages, bins=25)

ax = plt.subplot(111)
ax.bar(edges[:-1], H / float(sum(H)), width=edges[1] - edges[0])
ax.set_xlabel("Passenger Age")
ax.set_ylabel("Frequency of Being on Board")
ax.minorticks_on()
plt.show()

Question 4

Find the probability that a passenger was less than 10 years old.



In [9]:

    
less_then_ten = []
for i in all_ages:
    if i < 10:
        less_then_ten.append(i)
        
prob_less_then_ten = (len(less_then_ten) / len(all_ages))
print("There is a " + str(round(prob_less_then_ten, 3)) + " probabililty that a passenger was less then 10 year old.")









    



There is a 0.087 probabililty that a passenger was less then 10 year old.

Question 5

Given 100 passengers at random, determine the probability that exactly 42 passengers survive.



In [10]:

    
from scipy.stats import binom
binom.pmf(42, 100, survivor_prob)









    Out[10]:





0.061330411815167886

There is a 0.0613 probability that exactly 42 passenger survive out of 100. See above 'Out' for a more precise probability.

Question 5

What is the probability that at least 42 of those 100 passenger survive?



In [11]:

    
1 - binom.cdf(42, 100, survivor_prob)









    Out[11]:





0.19807683025744727

There is a 0.198 probability that at least 42 of those 100 passenger survive. See above 'Out' for a more precise probability.

Question 6

Is there a statistically significant difference between the age of male and female survivors?

COMPENSATION: Within our male and female survival groups we also had a match to throw out any age that matches the mean to the third the decimal to remove the skew. Post skew compensation increased the p value, indicating a correct compensation.



In [12]:

    
from scipy.stats import ttest_ind

survivors_male = survivors[(survivors.Sex == 'male') & (round(survivors.Age,3) != round(age_mean, 3)) ]
survivors_female = survivors[(survivors.Sex == 'female') & (round(survivors.Age, 3) != round(age_mean, 3))]
t_stat, p_value = ttest_ind(survivors_male.Age, survivors_female.Age)

print("Results:\n\tt-statistic: %.5f\n\tp-value: %.5f" % (t_stat, p_value))









    



Results:
	t-statistic: -0.83512
	p-value: 0.40434

There is no significance between the age of female and male survivors. This is because the p-value is greater than 0.05.

COMPENSATION: The age ditribution plot below against survivors also has our compensation affect which reduced a spike of survivors at the 25-30 age range.



In [13]:

    
plt.figure(figsize=(10, 4))
opacity = 0.5

plt.hist(survivors_male.Age, bins=np.arange(0, 90, 5), alpha=opacity, label="Males")
plt.hist(survivors_female.Age, bins=np.arange(0, 90, 5), alpha=opacity, label="Females")
plt.legend()
plt.title("Age Distribution of Female and Male Survivors")
plt.xlabel("Ages")
plt.ylabel("Survival")
plt.show()

Question 7

Is there a statistically significant difference between the fares paid by passengers between Queentown and the passengers from Cherbourg?



In [14]:

    
from scipy.stats import ttest_ind
fare_from_q = titanic_data[titanic_data.Embarked == 'Q']
fare_from_c = titanic_data[titanic_data.Embarked == 'C']
t_stat, p_value = ttest_ind(fare_from_q.Fare, fare_from_c.Fare)

print("Results:\n\tt-statistic: %.5f\n\tp-value: %g" % (t_stat, p_value))









    



Results:
	t-statistic: -4.84439
	p-value: 2.26359e-06

There is statistical difference in the fares paid between the passengers at Queentown and Cherboug. This is indicated by the p-value that is less than 0.01.



In [15]:

    
plt.figure(figsize=(10, 4))
opacity = 0.5

plt.hist(fare_from_q.Fare, bins=np.arange(0, 90, 5), alpha=opacity, label="Queenstown")
plt.hist(fare_from_c.Fare, bins=np.arange(0, 90, 5), alpha=opacity, label="Cherbourg")
plt.legend()
plt.title("Fare Distribution from Queenstown to Cherbourg")
plt.xlabel("Fare Price")
plt.ylabel("Number of Passengers")
plt.show()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S